Battle royale games have surged in popularity in recent years. The premise of such games is as follows: players are dropped onto a fictional island and fight to be the last person standing. As they roam around the island, they loot for weapons and items crucial for their survival. Players can choose to join a game as a solo player or with a group of friends (4 players maximum). When playing solo, players are immediately eliminated when they are killed. However, in group play, killed individuals can be revived by their teammates.
We are interested in building a prediction model for the popular battle royale game PUBG (PlayerUnknown’s Battlegrounds). In PUBG, players not only have to worry about getting killed by other players, but they also have to stay within the shrinking “safe zone,” which effectively forces players into contact with each other. Outside of the “safe zone,” players take damage to their health at increasing rates.
Through our analysis, we aim to understand what characterizes winning players or teams: How aggressive are the playing styles of the winners? Is it better to land in a densely or sparsely populated area? Do players who travel farther on the map tend to place higher or lower? Answers to such questions will be of high interest for the PUBG gaming community.
The main goal of this project is to predict a player’s finish placement based on their in-game actions. Specifically, the three subquestions of interest are:
The data comes from the Kaggle competition.
download_data.sh.data.url <- paste0("https://www.dropbox.com/s/mp89gp57cz2dsc7/train_V2.csv.zip?dl=1")
if(!file.exists("./data/train_V2.csv.zip")){
download.file(data.url, destfile = "./data/train_V2.csv.zip", mode = "wb")
}
# Warning: Large dataset (628 MB), will take a minute or so to read.
raw_dat <- read_csv("./data/train_V2.csv.zip")
clean_dat = raw_dat %>%
clean_names() %>%
drop_na(win_place_perc) # Drop rows without outcome variable
Each row in the data contains one player’s post-game stats. A description of all data fields is provided in data/pubg_codebook.csv. We will focus on the solo game mode (match_type is solo, solo-fpp, or normal-solo-fpp). The solo game mode constitutes about 16% of the data, with 720,386 observations. The outcome variable we are trying to predict is win_place_perc.
solo_dat <- clean_dat %>%
#sample_n(10000) %>%
filter(match_type %in% c("solo", "solo-fpp", "normal-solo-fpp")) %>%
select(-dbn_os, -assists, -revives, -group_id, -match_type, -team_kills) %>% # Remove features that are not relevant to solo mode
mutate(kill_points = ifelse(rank_points == -1 & kill_points == 0, NA, kill_points), # Following codebook explanations
win_points = ifelse(rank_points == -1 & win_points == 0, NA, win_points),
rank_points = ifelse(rank_points == -1, NA, rank_points),
id = as.factor(id),
match_id = as.factor(match_id)) %>%
select(-rank_points) # Variable is being deprecated
We are given a training set and a test set. The outcome variable for the test set will not be provided until the end of the Kaggle competition in Jan. 30th, 2019. Therefore, for the purposes of this project, we will only be using the provided training set. Within the provided training set, we will create our own “training” (80%) and “test” set (20%). For the rest of the document, the training set we refer to is the one we’ve created.
# Split into train and test set
set.seed(1)
train_ind = createDataPartition(y = solo_dat$win_place_perc, p = 0.8, list = F)
train_solo = solo_dat %>%
slice(train_ind)
test_solo = solo_dat %>%
slice(-train_ind)
In the training set, we have 576310 players and 8071 matches.
# Compute proportions
prop_data = train_solo %>%
group_by(match_id, max_place, match_duration) %>%
count() %>%
ungroup() %>%
mutate(prop = n/max_place,
remove_game = prop > 1)
# Games with proportion greater than 100%
prop_over_100 = prop_data %>%
summarize(prop_n = sum(prop > 1),
prop_games = prop_n/n())
# Histogram
prop_data %>%
ggplot(aes(x = prop)) +
geom_histogram(bins = 30, color = 'white') +
labs(title = "Proportion of players we have data for in a game",
x = "Proportion",
y = "Count") +
theme_minimal()
# Remove games with proportion greater than 100%
remove_match_ids = prop_data %>%
filter(remove_game) %>%
pull(match_id)
train_solo = train_solo %>%
filter(!(match_id %in% remove_match_ids))
For most games, we have between 70% to 90% of the players’ data, using max_place (the worst placement for which we have data) as a proxy for total number of players. For 14 games (0.17% of all games), we have more observations than max_place, which is not possible. Thus, we excluded these games from our analysis.
We first explored the distribution of each feature by the final finish percentile. Players were first grouped into the 0-19th, 20th-39th, 40th-59th, 60th-79th, or 80th-100th percentile finish. Then we plotted the density of features by percentile groups. Note that due to extreme outliers, we excluded the highest 1% of many of the features for clearer visualizations.
set.seed(1)
filter_vars = c("boosts", "damage_dealt", "headshot_kills", "heals", "kills", "longest_kill", "ride_distance", "swim_distance", "walk_distance", "weapons_acquired")
train_solo %>%
filter_at(vars(filter_vars), all_vars(. < quantile(., 0.99, na.rm = T))) %>% # Remove outliers
rename_at(vars(filter_vars), ~ sapply(filter_vars, function(str) paste0(str, "*"))) %>% # Mark variables for which we removed outliers with asterisk features
mutate(win_place_cat = floor(win_place_perc / 0.2),
win_place_cat = ifelse(win_place_cat == 5, 4, win_place_cat),
win_place_cat = as.factor(win_place_cat)) %>%
gather("feature", "value", -match_id, -match_duration,
-id, -win_place_perc, -win_place_cat) %>%
ggplot(aes(x = value, group = win_place_cat, color = win_place_cat)) +
facet_wrap(feature ~., scales = "free") +
geom_density() +
labs(title = "Distribution of Features by Finish Percentile",
caption = "* Removed outliers (> 99th percentile) from this feature's density plot",
x = "Value of Features", y = "Density", color = "Percentile") +
scale_color_manual(labels = c("0-19", "20-39", "40-59", "60-79", "80-100"),
values = brewer.pal(5, "OrRd")) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Note: Need to resolve warning (non-finite values)
Some interesting relationships between the features and the finish percentile:
Use of Items (boosts, heals, and weapons_acquired): Players who finish higher tend to have used more boosts and healing items, and acquired more weapons. This is expected since they stayed in the game for a longer period and have more time to collect and use items. However, it would be interesting to explore which of these variables is most predictive of a high finish placement.
Kills & Damage (damage_dealt, kill_place, and kills): Players who finish higher tend to have more kills. They also tend to have dealt more damage. However, in the top finishing group, there is a wide variety in how much damage they inflict. This could potentially indicate strategies that differ in their level of aggressiveness during the course of the game but are similarly successful in achieving a high placement.
Distance Traveled (walk_distance, swim_distance, and ride_distance): Players who finish higher tend to have walked farther. This is likely because they simply survive longer and are force to travel to stay in the safe zone, whereas players who die early don’t get a chance to travel very far. Both swimming and riding in vehicles are rare occurrences, though it appears that players who finish higher also tend to do more of both.
(Additional notes to ourselves regarding some features of the data we might want to look at):
kill_place, kill_points, kills, and win_points follow bimodal distributions. This may reflect the play-styles of each player. Players who land in populated areas are more likely to encounter other players, resulting in a higher porbability of dying or a larger number of kills if the player survives. Thus, we can partition players in the 10th percentile finish into two categories: a skilled player who but dies early due to dropping in a populated location, but due to their skill acquires a large number of kills or a less-skilled player who dies early due to lack of skill despite dropping in a less populated location.longest_kill, ride_distance, swim_distance, ride_distance, etc.). We may want to log-transform these variables in our model building.num_groups density plots suggest that in games where we have little data, we tend to have data on the winners. Thus, there may be some imbalance in the data we will need to either adjust for to ensure that our model doesn’t overestimate finish percentile.rank_points, win_points and kill_points are external characteristics (from previous games) that attempt to characterize the skill level of a player. These distributions are bimodal which may reflect the extremes of the two playstyles described above. It seems that kill_points has more predictive value of finish percentile as the right-shift is more distinct by finish percentage category than rank_points. Interestingly, rank_points suggests that prior-game ranks do not have a large impact on the final placement in a game (though there is a note int the pubg_codebook.csv file that this metric is deprecated). This makes sense since in-game variables like drop location, loot, and circle movement can affect how likely an individual is to win.Statistics related to kills seem to be well correlated with finish percentile. Additional duration of game does not seem to be strongly correlated with many of the in-game features such as kills, walk_distance, etc.
corr_matrix = train_solo %>%
select(-id, -match_id) %>%
cor()
corrplot(corr_matrix, method = "color", type = "upper")
We will fit the following models:
First, we fit a linear regression model with all features to get a baseline predictive accuracy we are able to achieve without tuning any parameters or using more complicated models.
set.seed(1)
tc = trainControl(method = "cv", number = 5)
lr_model = train(win_place_perc ~ ., data = train_solo, method = "lm", trControl = tc)
summary(lr_model)
Try elastic net: uses both ridge regression-like penalty (controls for correlated features) and lasso penalty (variable selection).
lasso_model = train(win_place_perc ~ ., data = train_solo, method = "glmnet", trControl = tc)
summary(lasso_model)
rf_model = train(win_place_perc ~ ., data = train_solo, method = "rf", trControl = tc)
summary(rf_model)
Comparison of model accuracies with a table or a plot. What is the highest accuracy we are able to achieve (with cross-validation)? What is that model’s accuracy on the test set?
There are many possible ways to evaluate accuracy. We can do MSE or we can count number of games we correctly predicted the winner for. I think that this section is probably going to be the least interesting one, so we don’t have to spend too much time here.
We can try the following and see which one gives us a better story:
We can do PCA, plot observations along principal components, color points by outcome, and then try to interpret the top principal components in terms of the features.
Manually summarize the features into something like the 3-4 categories described in the exploratory data analysis section, i.e. engineer a summary feature for “damage/kills”, another summary feature for “distance travelled.” Consult correlation plot and PCA plot to get good groupings as well. Then plot observations along these summary features and color points by outcome. This is basically the same idea as PCA except we already have a story in place for what the components mean.
Which features are most predictive? I think we will have a better idea of this after we run the models and do clustering. Based on what we know, we can summarize what we have learned, with some additional evidence.